Assignment 2

Jasper Phang Wee Keat (A0201523Y)


Research Question

In this assignment, we will investigate if socio-economic factors such as Age, Ethnicity, Income deprivation and Education deprivation are correlated with different consumer food choices. We will be using the Tesco Grocery 1.0 dataset, LSOA Atlas dataset and English Indices of Deprivation dataset.


Data Sources

Tesco Grocery

The Tesco Grocery 1.0 dataset can be downloaded at Figshare and is detailed in this open access paper (Aiello et al. 2020).

The paper provided a short description of all the fields in the dataset provided.

  • area_id | Identifier of the area
  • weight | Weight of the average food product, in grams
  • volume | Volume of the average drink product, in liters
  • energy | Nutritional energy of the average product, in kcals
  • energy_density | Concentration of calories in the area’s average product, in kcals/gram
  • nutrient. | Weight of nutrient in the average product, in grams. Possible nutrients are: carbs, sugar, fat, saturated fat, protein, fibre. The count of carbs include sugars and the count of fats includes saturated fats.
  • energy_nutrient | Amount of energy from nutrient in the average product, in kcals
  • h_nutrients_weight | Entropy of nutrients weight
  • h_nutrients_weight_norm | Entropy of nutrients weight, normalized in [0,1]
  • h_nutrients_calories | Entropy of energy from nutrients
  • h_nutrients_calories_norm | Entropy of energy from nutrients, normalized in [0,1]
  • f_category | Fraction of products of type purchased. Possible categories are: beer, dairy, eggs, fats & oils, fish, fruit & veg, grains, red meat, poultry, readymade, sauces, soft drinks, spirits, sweets, tea & coffee, water, and wine.
  • f_category_weight | Fraction of total product weight given by products of type category
  • h_category | Entropy of food product types
  • h_category_norm. | Entropy of food product categories, normalized in [0,1].
  • h_category_weight | Entropy of weight of food product categories
  • h_category_weight_norm | Entropy of weight of food product categories, normalized in [0,1].
  • representativeness_norm | The ratio between the number of unique customers in the area and the number of residents as measured by the census; values are min-max normalized in [0,1] across all areas
  • transaction_days | Number of unique dates in which at least one purchase has been made by one of the residents in the area.
  • num_transactions | Total number of products purchased by Clubcard owners who are resident in the area
  • man_day Cumulative number of man-days of purchase (number of distinct days a customer has purchased something, summed all individual customers)
  • population | Total population of residents in the area according to the 2015 census
  • male | Total male population in the area
  • female | Total female population in the area
  • age_0_17 | Total number of residents between 0 and 17 years old
  • age_18_64 | Total number of residents between 18 and 64 years old
  • age_65 + | Total number of residents aged 65 years or more
  • avg_age | Average age of residents according to the 2015 census
  • area_sq_km | Surface of the area (km2)
  • people_per_sq_km | Population density per km2

We will be focusing on the Age Group, Food Categories and Food Nurients data for each LSOA.

Lower Super Output Area (LSOA)

The LSOA 2011 boundaries last updated in 2020 by the Greater London Authority (GLA) can be downloaded here

The LSOA Atlas dataset last updated in 2015 by the Greater London Authority (GLA) provides a summary of demographic and related data for each Lower Super Output Area in Greater London. It provides data on the population, diversity, households, health, housing, crime,benefits, land use, deprivation, schools, and employment of each small area.

We will be focusing on the Ethnic Group data for each LSOA.

English Indices of Deprivation 2015

The English Indices of Deprivation (ID) data for London, last updated in 2020 by the Ministry of Housing, Communities & Local Government (MHCLG) can be downloaded here. Instead of the 2019 data, We will be using the 2015 data as it is closer to the Tesco dataset.

The ID data considers Income Deprivation, Employment Deprivation, Education, Skills and Training Deprivation, Health Deprivation and Disability, Crime, Barriers to Housing and Services, and Living Environment Deprivation.

We will be focusing on Income Deprivation and Education Deprivation for each LSOA.

Required packages

library(tidyverse)
library(ggthemes)
library(sf)
library(tmap)
library(corrplot)
library(stringr)
library(DT)
library(ggrepel)

Importing Data

Tesco Grocery Dataset

We verify that the grocery dataset has all the fields mentioned in the open access paper. It includes data about the energy, nutrients and categories of grocery purchases and when these purchases were made. Also, it contains demographic data of the LSOA.

grocery_LSOA <- read_csv("data_tesco/area_level_grocery_purchases/year_lsoa_grocery.csv")
grocery_LSOA <- rename(grocery_LSOA, age_65_above = 'age_65+')

datatable(head(grocery_LSOA))

LSOA Boundaries

We will be using the Statistical GIS Boundary for LSOA as at 2011.

LSOA_shp <- read_sf('data_tesco/statistical-gis-boundaries-london/ESRI/LSOA_2011_London_gen_MHW.shp')

tm_shape(LSOA_shp) +
  tm_borders(col='gray') +
  tm_layout(frame=FALSE) +
  tm_credits("Data: Greater London Authority (2020)", position = c("left","bottom"))

LSOA Atlas

We have a wide range of data to choose from in the LSOA Atlas, including population, diversity, households, health, housing, crime, etc.

atlas_LSOA <- read_csv("data_tesco/lsoa-data.csv")
datatable(head(atlas_LSOA))

LSOA Indices of Deprivation

The Indices of Deprivation 2015 provide a set of relative measures of deprivation in 32,844 LSOAs across England, based on seven domains of deprivation. The domains were combined using the following weights to produce the overall Index of Multiple Deprivation:

  • Income Deprivation (22.5%)
  • Employment Deprivation (22.5%)
  • Education, Skills and Training Deprivation (13.5%)
  • Health Deprivation and Disability (13.5%)
  • Crime (9.3%)
  • Barriers to Housing and Services (9.3%)
  • Living Environment Deprivation (9.3%)

The Index ranks every LSOA from 1 (most deprived area) to 32,844 (least deprived area). The 32,844 LSOAs have been divided according to their deprivation rank into 10 equal groups (deciles).

We will be using the income and education deprivation data find out if it has any impact of food choices. These indices are more comprehensive compared to the median household income or house prices from the LSOA Atlas dataset above. Also, both data for the indices and Tesco grocery data are collected in 2015.

ID_LSOA <- read_csv("data_tesco/ID 2015 for London.csv")
datatable(head(ID_LSOA))

Exploratory Analysis

According to the open access paper, we can finetune the desired level of general population’s representativeness. We will verify that a representativeness of at least 0.1 (10% of the residents represented) will leave us with 80% the total number of area.

grocery_LSOA_rep <- grocery_LSOA %>%
                  select(area_id, representativeness_norm) %>%
                  filter(representativeness_norm > 0.1)
datatable(grocery_LSOA_rep, rownames = FALSE)
nrow(grocery_LSOA_rep) / nrow(grocery_LSOA)
## [1] 0.8015725

Representativeness of Data

Full Dataset

LSOA_shp_full <- inner_join(LSOA_shp, grocery_LSOA, by = c('LSOA11CD'='area_id'))

tm_shape(LSOA_shp_full) +
  tm_polygons('representativeness_norm', palette='BuPu', border.col='white', 
              border.alpha = 0.5, breaks = c(seq(0,1, by=0.1)), 
              title='Representativeness') +
  tm_layout (main.title = '% of Population Represented', legend.position = c('right','bottom'),
             legend.text.size = 0.8, frame=FALSE) +
  tm_credits("Data: Greater London Authority (2020), Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)",
             position = c("left","bottom"))

At least 0.1 representativeness

LSOA_shp_rep <- left_join(LSOA_shp, grocery_LSOA_rep, by = c('LSOA11CD'='area_id'))

tm_shape(LSOA_shp_rep) +
  tm_polygons('representativeness_norm', palette='BuPu', border.col='white', 
              border.alpha = 0.5, breaks = c(seq(0,1, by=0.1)), 
              title='Representativeness', textNA = 'Not Representative') +
  tm_layout (main.title = '% of Population Represented', legend.position = c('right','bottom'),
             legend.text.size = 0.8, frame=FALSE) +
  tm_credits("Data: Greater London Authority (2020), Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)",
             position = c("left","bottom"))

We verified that the customers in certain areas are not representative of the general population.

Below is a map from the open access paper reflecting the stores’ positions (dots) and the number of grocery transactions at LSOA level (Aiello et al. 2020). We can see that Tesco stores are not uniformly distributed across the city, hence an uneven distribution of penetration rates is observed across the city.


Distribution of Purchases

Food Category

grocery_LSOA %>%  
  filter(representativeness_norm > 0.1) %>%
  select ("f_beer", "f_dairy", "f_eggs", "f_fats_oils",
          "f_fish","f_fruit_veg", "f_grains",
          "f_meat_red","f_poultry","f_readymade", "f_sauces",
          "f_soft_drinks","f_spirits", "f_sweets",
          "f_tea_coffee","f_water","f_wine") %>% 
          gather() %>%
  mutate(key = str_replace_all(key, "f_", "")) %>%
  mutate(key = str_replace_all(key, "_", " ")) %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = 'free', nrow = 5) +
    geom_histogram(aes(y=..density..), color = 'black',fill='grey90') +
    geom_density(alpha=0.01, fill='white') +
  theme_minimal() +
  labs(title='Distribution of Food Categories', x='', y="")

Food Nutrients

grocery_LSOA %>%  
  filter(representativeness_norm > 0.1) %>%
  select ("fat","saturate", "salt", "sugar", 
          "protein", "carb",    "fibre",    "alcohol") %>% 
          gather() %>%
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = 'free', nrow = 2) +
    geom_histogram(aes(y=stat(density)), color = 'black',fill='grey90') +
    geom_density(alpha=0.01, fill='white') +
    theme_minimal() +
    labs(title='Distribution of Food Nutrients', x='', y="")

Results

1 | Impact of Age on food choices

1.1 | Age Group - Food Category

We will first focus on the impacts of Age Groups on food category choices.

According to Azaïs-Braesco et al. (2017), total and added sugar intakes are high in the European countries studied, particularly in children, pointing to sweet products and beverages as the predominant sources of added sugar intakes.

We hypothesize a positive correlation between LSOAs with higher share of children and sweets & soft drinks purchases.

grocery_LSOA %>% 
  mutate(age_0_17 = age_0_17/population,
  age_18_64 = age_18_64/population,
  age_65_above = age_65_above/population) -> grocery_LSOA
grocery_category <- grocery_LSOA %>%
                    filter(representativeness_norm > 0.1) %>% 
                    select ("age_0_17","age_18_64","age_65_above",
                            "f_beer", "f_dairy", "f_eggs", "f_fats_oils", "f_fish","f_fruit_veg",
                            "f_grains","f_meat_red","f_poultry","f_readymade", "f_sauces",
                            "f_soft_drinks","f_spirits", "f_sweets","f_tea_coffee","f_water","f_wine")
names(grocery_category) <- str_replace_all(names(grocery_category), "f_", "")

datatable(grocery_category, rownames = FALSE)
corr_category <- cor(grocery_category)
corrplot(corr_category, method='square', type='upper', title = 'Age Group - Food Category',
          mar=c(0,0,3,0), addCoef.col = "black", number.cex=0.9, diag=FALSE, tl.srt=45) 

Age 0-17 - Soft Drinks

ggplot(grocery_category, aes(x=age_0_17, y=soft_drinks)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method='lm', color='blue') +
  theme_minimal() +
  labs(x = 'Age 0 to 17', y = 'Soft Drinks', 
       title = 'Share of Children in LSOA vs Soft Drink purchases',
       caption = 'Data: Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=0.45,y=0.3,label='R = 0.26', color='blue')

Age 0-17 - Sweets

ggplot(grocery_category, aes(x=age_0_17, y=sweets)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method='lm', color='blue') +
  theme_minimal() +
  labs(x = 'Age 0 to 17', y = 'Sweets', 
       title = 'Share of Children in LSOA vs Sweet purchases',
       caption = 'Data: Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=0.45,y=0.3,label='R = 0.38', color='blue')

Age 18-64 - Sweets

ggplot(grocery_category, aes(x=age_18_64, y=sweets)) +
  geom_point(alpha=0.3) +
  geom_smooth(method='lm') +
  theme_minimal()  +
  labs(x = 'Age 18 to 64', y = 'Sweets', 
       title = 'Share of Adults in LSOA vs Sweet purchases',
       caption = 'Data: Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=0.875,y=0.3,label='R = -0.38', color='blue')

Observation
We observe that LSOAs with higher share of children have weak positive correlation to soft drink purchases (0.26) and sweet purchases (0.38).
While LSOAs with higher share of Adults have weak negative correlation to sweet purchases (-0.38).

1.2 | Age Group - Food Nutrients

We will then focus on the impacts of Age Groups on food nutrients.
We hypothesize a positive correlation between LSOAs with higher proportion of children and higher sugar contents in purchases.

grocery_nutrients <- grocery_LSOA %>%
                    filter(representativeness_norm > 0.1) %>% 
                    select ("age_0_17","age_18_64","age_65_above",
                            "fat","saturate",   "salt", "sugar", 
                            "protein", "carb",  "fibre",    "alcohol")

datatable(grocery_nutrients, rownames = FALSE)
corr_nutrients <- cor(grocery_nutrients)
corrplot(corr_nutrients, method='square', type='upper', title = 'Age Group - Food Nutrients',
          mar=c(0,0,3,0), addCoef.col = "black", diag=FALSE, tl.srt=45)

Age 0-17 - Sugar

ggplot(grocery_nutrients, aes(x=age_0_17, y=sugar)) +
  geom_point(alpha=0.3) +
  geom_smooth(method='lm') +
  theme_minimal() +
  labs(x = 'Age 0 to 17', y = 'Average Sugar', 
       title = 'Share of Children in LSOA vs Sugar in Food purchases',
       caption = 'Data: Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=0.45,y=18,label='R = 0.38', color='blue')

Age 18-46 - Sugar

ggplot(grocery_nutrients, aes(x=age_18_64, y=sugar)) +
  geom_point(alpha=0.3) +
  geom_smooth(method='lm') +
  theme_minimal() +
  labs(x = 'Age 18 to 64', y = 'Average Sugar', 
       title = 'Share of Adults in LSOA vs Sugar in Food purchases',
       caption = 'Data: Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=0.875,y=18,label='R = -0.42', color='blue')

Observation
We observe that LSOAs with higher share of children have weak positive correlation to higher sugar contents in purchases (0.38).
While LSOAs with higher share of Adults have moderate negative correlation to higher sugar contents in purchases (-0.42).

1.3 | Age Group - Weight of Purchases

We will then focus on the impacts of Age Groups on weight of food purchases.

According to Drewnowski and Shultz (2001), daily volume of foods and beverages declines as a function of age.

We hypothesize a negative correlation between LSOAs with higher proportion of seniors and weight of food purchases.

Age 65 and above - Weight

cor(grocery_LSOA$weight, grocery_LSOA$age_65_above)
## [1] -0.3164854
grocery_LSOA %>% 
  select(weight, volume, age_65_above) %>% 
ggplot(aes(x=age_65_above, y=weight)) +
geom_point(alpha=0.3)+
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Age 65 and Above', y = 'Average Weight of Purchase', 
     title = 'Share of Seniors in LSOA vs Weight of Food purchases',
     caption = 'Data: Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=0.4,y=600,label='R = -0.32', color='blue')

Age 0-17 - Weight

cor(grocery_LSOA$weight, grocery_LSOA$age_0_17)
## [1] 0.3026058
grocery_LSOA %>% 
  select(weight, volume, age_0_17) %>% 
ggplot(aes(x=age_0_17, y=weight)) +
geom_point(alpha=0.3)+
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Age 0 to 17', y = 'Average Weight of Purchase', 
     title = 'Share of Children in LSOA vs Weight of Food purchases',
     caption = 'Data: Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=0.4,y=600,label='R = 0.30', color='blue')

Observation
We observe that LSOAs with higher share of seniors have a weak negative correlation to average weight of food purchases.(-0.32).
We also observed that LSOAs with higher share of children have a weak positive correlation to average weight of food purchases. (0.30).


2 | Impact of Ethnicity of food choices

We will first visualise the geographical distribution of different ethnic groups in London.

ethnicity <- atlas_LSOA %>%
                select ("Lower Super Output Area",
                        "Ethnic Group;White (%);2011",
                        "Ethnic Group;Mixed/multiple ethnic groups (%);2011",
                        "Ethnic Group;Asian/Asian British (%);2011",
                        "Ethnic Group;Black/African/Caribbean/Black British (%);2011",
                        "Ethnic Group;Other ethnic group (%);2011") %>%
                rename(area_id = "Lower Super Output Area",
                       white = "Ethnic Group;White (%);2011",
                       mixed = "Ethnic Group;Mixed/multiple ethnic groups (%);2011",
                       asian = "Ethnic Group;Asian/Asian British (%);2011",
                       black = "Ethnic Group;Black/African/Caribbean/Black British (%);2011",
                       others = "Ethnic Group;Other ethnic group (%);2011")
datatable(ethnicity, rownames = FALSE)
summary(ethnicity)
##    area_id              white           mixed            asian      
##  Length:4835        Min.   : 3.50   Min.   : 0.600   Min.   : 0.80  
##  Class :character   1st Qu.:45.90   1st Qu.: 3.500   1st Qu.: 7.30  
##  Mode  :character   Median :63.10   Median : 4.800   Median :12.00  
##                     Mean   :60.71   Mean   : 4.923   Mean   :17.93  
##                     3rd Qu.:77.60   3rd Qu.: 6.200   3rd Qu.:21.90  
##                     Max.   :98.20   Max.   :14.400   Max.   :86.90  
##      black           others      
##  Min.   : 0.10   Min.   : 0.000  
##  1st Qu.: 4.30   1st Qu.: 1.500  
##  Median : 9.50   Median : 2.700  
##  Mean   :13.05   Mean   : 3.381  
##  3rd Qu.:18.90   3rd Qu.: 4.500  
##  Max.   :63.70   Max.   :36.600
LSOA_shp_ethnic <- left_join(LSOA_shp, ethnicity, by = c('LSOA11CD'='area_id'))

We will be focusing on the 3 main ethnic groups: White, Asian and Black.

Geographical Distribution of Ethnic Groups

White

tm_shape(LSOA_shp_ethnic) +
  tm_polygons('white', palette='GnBu', border.col='white', border.alpha = 0.5,
              breaks = c(seq(0,100, by=10)), title='White (%)', legend.is.portrait = FALSE) +
  tm_layout (main.title = 'Ethnic Groups by LSOA', legend.outside = TRUE,
            legend.outside.position = "bottom", legend.text.size = 0.8, frame=FALSE) +
  tm_credits("Data: Greater London Authority (2015,2020)", position = c("left","bottom"))

Asian

tm_shape(LSOA_shp_ethnic) +
  tm_polygons('asian', palette='GnBu', border.col='white', border.alpha = 0.5,
              breaks = c(seq(0,100, by=10)), title='Asian (%)', legend.is.portrait = FALSE) +
  tm_layout (main.title = 'Ethnic Groups by LSOA', legend.outside = TRUE,
            legend.outside.position = "bottom", legend.text.size = 0.8, frame=FALSE) +
  tm_credits("Data: Greater London Authority (2015,2020)", position = c("left","bottom"))

Black

tm_shape(LSOA_shp_ethnic) +
  tm_polygons('black', palette='GnBu', border.col='white', border.alpha = 0.5,
              breaks = c(seq(0,100, by=10)), title='Black (%)', legend.is.portrait = FALSE) +
  tm_layout (main.title = 'Ethnic Groups by LSOA', legend.outside = TRUE,
            legend.outside.position = "bottom", legend.text.size = 0.8, frame=FALSE) +
  tm_credits("Data: Greater London Authority (2015,2020)", position = c("left","bottom"))

Before we begin, we first consider the areas with a minimum of 0.1 representativeness and check if we have sufficient representation of both predominantly white and non-white areas.

LSOA_rep <- grocery_LSOA %>%
            filter(representativeness_norm > 0.1) %>% 
            select(area_id)
ethnicity_rep <- left_join(LSOA_rep, ethnicity) %>% 
                  select('area_id', 'white', 'asian','black')%>% 
                  mutate(majority = case_when(white >= 50 ~ 'white',
                                              TRUE ~ 'non_white'))
table(ethnicity_rep$majority)
## 
## non_white     white 
##      1348      2526
LSOA_shp_ethnic_rep <- left_join(LSOA_shp, ethnicity_rep, by = c('LSOA11CD'='area_id'))

tm_shape(LSOA_shp_ethnic_rep) +
  tm_polygons('majority', palette='GnBu', border.col='white', border.alpha = 0.5,
              breaks = c(seq(0,100, by=10)), title='White vs Non-White',
              textNA = "Not Representative", labels = c("Non-White", "White"),
              legend.is.portrait = FALSE) +
    tm_layout (main.title = 'Predominant Ethnicity by LSOA', legend.outside = TRUE,
            legend.outside.position = "bottom", legend.text.size = 0.8, frame=FALSE) +
  tm_credits("Data: Greater London Authority (2015,2020)", position = c("left","bottom"))

2.1 | Ethnicity - Food Category

We will first focus on the impacts of Ethnic Groups on food category choices.

According to Stewart et al. (2021), Asian and Asian British consumed the least meat when compared to White, Black, Black British and Others.

We hypothesize a negative correlation between LSOAs with higher share of Asians and meat purchases.

grocery_ethnicity <- grocery_LSOA %>%
                    filter(representativeness_norm > 0.1) %>% 
                    select ("area_id","f_beer", "f_dairy", "f_eggs", "f_fats_oils",
                            "f_fish","f_fruit_veg", "f_grains","f_meat_red",
                            "f_poultry","f_readymade", "f_sauces",
                            "f_soft_drinks","f_spirits","f_sweets",
                            "f_tea_coffee","f_water","f_wine")
names(grocery_ethnicity) <- str_replace_all(names(grocery_ethnicity), "f_", "")

grocery_ethnicity <- left_join(ethnicity_rep, grocery_ethnicity)

datatable(grocery_ethnicity, rownames = FALSE)
corr_ethnicity <- cor(select(grocery_ethnicity,-area_id,-majority))
corrplot(corr_ethnicity, method='square', type='upper', title = 'Ethnicity - Food Category',
          mar=c(0,0,3,0), addCoef.col = "black", number.cex=0.9, diag=FALSE, tl.srt=45)

Asian - Red Meat

ggplot(grocery_ethnicity, aes(x=asian, y=meat_red)) +
  geom_point(aes(color=majority), alpha=0.5) +
  scale_color_hue(labels = c("Non-White", "White")) +
  geom_smooth(method='lm', color='blue') +
  theme_minimal() +
  labs(x = 'Asian', y = 'Red Meat', color = 'Majority', 
       title = 'Share of Asian in LSOA vs Red Meat purchases',
       caption = 'Data: Greater London Authority (2015), Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=75,y=0.08,label='R = -0.46', color='blue')

Asian - Poultry

ggplot(grocery_ethnicity, aes(x=asian, y=poultry)) +
  geom_point(aes(color=majority), alpha=0.5) +
  scale_color_hue(labels = c("Non-White", "White")) +
  geom_smooth(method='lm', color='blue') +
  theme_minimal() +
  labs(x = 'Asian', y = 'Poultry', color = 'Majority', 
       title = 'Share of Asian in LSOA vs Poultry purchases',
       caption = 'Data: Greater London Authority (2015), Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=75,y=0.04,label='R = -0.38', color='blue')

Observation
We observe that LSOAs with higher share of Asian have moderate negative correlation to red meat purchases (-0.46) and weak negative correlation to poultry purchases (-0.38).

2.2 | Ethnicity - Food Nutrients

We will now focus on the impacts of Ethnic Groups on food nutrients.

According to Goff et al. (2013), black African-Caribbean groups consumed a lower intake of saturated fat compared to white- European populations.

We hypothesize a positive correlation between LSOAs with higher share of Blacks/Africans/Caribbeans and LSOAs with lower saturated fat contents in food purchases.

grocery_ethnicity_nutrients <- grocery_LSOA %>%
                    filter(representativeness_norm > 0.1) %>% 
                    select ("area_id", "fat","saturate",    "salt", "sugar", 
                            "protein", "carb",  "fibre",    "alcohol")

grocery_ethnicity_nutrients <- left_join(ethnicity_rep, grocery_ethnicity_nutrients)

datatable(grocery_ethnicity_nutrients, rownames = FALSE)
corr_ethnicity <- cor(select(grocery_ethnicity_nutrients,-area_id, -majority))
corrplot(corr_ethnicity, method='square', type='upper', title = 'Ethnicity - Food Nutrients',
          mar=c(0,0,3,0), addCoef.col = "black", number.cex=0.9, diag=FALSE, tl.srt=45)

Black - Saturated Fat

ggplot(grocery_ethnicity_nutrients, aes(x=black, y=saturate)) +
  geom_point(aes(color=majority), alpha=0.5) +
  scale_color_hue(labels = c("Non-White", "White")) +
  geom_smooth(method='lm', color='blue') +
  theme_minimal() +
  labs(x = 'Black/African/Caribbean', y = 'Saturated Fat', color = 'Majority', 
       title = 'Share of Black/African/Caribbean in LSOA vs Saturated Fat in Food purchases',
       caption = 'Data: Greater London Authority (2015), Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=55,y=5,label='R = -0.18', color='blue')

Asian - Protein

ggplot(grocery_ethnicity_nutrients, aes(x=asian, y=protein)) +
  geom_point(aes(color=majority), alpha=0.5) +
  scale_color_hue(labels = c("Non-White", "White")) +
  geom_smooth(method='lm', color='blue') +
  theme_minimal() +
  labs(x = 'Asian', y = 'Protein', color = 'Majority', 
       title = 'Share of Asian in LSOA vs Protein in Food purchases',
       caption = 'Data: Greater London Authority (2015), Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
    annotate(geom='text',x=75,y=6,label='R = -0.53', color='blue')

Observation
We observe that LSOAs with higher share of Black/African/Caribbean have very weak negative correlation to saturated fat content in food purchases (-0.18).
Also, LSOAs with higher share of Asians have moderate negative correlation to protein content in food purchases (-0.53).


3 | Impact of Deprivation on food choices

We will first visualise the geographical distribution of deprivation in London.

indices_decile <- ID_LSOA %>%
                select ("LSOA code (2011)",
                        "IMD Decile (where 1 is most deprived 10% of LSOAs)",
                        "Income Decile (where 1 is most deprived 10% of LSOAs)",
                        "Education, Skills and Training Decile (where 1 is most deprived 10% of LSOAs)") %>%
                rename(area_id = "LSOA code (2011)",
                       IMD_decile = "IMD Decile (where 1 is most deprived 10% of LSOAs)",
                       income_decile = "Income Decile (where 1 is most deprived 10% of LSOAs)",
                       education_decile = "Education, Skills and Training Decile (where 1 is most deprived 10% of LSOAs)") %>% 
                mutate_if(is.numeric,as.factor)

datatable(indices_decile, rownames = FALSE)
LSOA_shp_decile <- left_join(LSOA_shp, indices_decile, by = c('LSOA11CD'='area_id'))

Deprivation Deciles of Population

IMD (Index of Multiple Deprivation)

tm_shape(LSOA_shp_decile) +
  tm_polygons('IMD_decile', palette='-GnBu', border.col='white', border.alpha = 0.5,
              title='IMD Decile', 
              labels = c("1 (Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10 (Least Deprived)")) +
  tm_layout (main.title = 'Deprivation Decile by LSOA', 
            legend.position = c('right','bottom'), frame=FALSE) +
  tm_credits("Data: Greater London Authority (2020), Ministry of Housing, Communities & Local Government (2020)", position = c("left","bottom"))

Income Deprivation

tm_shape(LSOA_shp_decile) +
  tm_polygons('income_decile', palette='-GnBu', border.col='white', border.alpha = 0.5,
              title='Income Decile',
              labels = c("1 (Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10 (Least Deprived)")) +
  tm_layout (main.title = 'Deprivation Decile by LSOA', 
            legend.position = c('right','bottom'), frame=FALSE) +
  tm_credits("Data: Greater London Authority (2020), Ministry of Housing, Communities & Local Government (2020)", position = c("left","bottom"))

Education Deprivation

tm_shape(LSOA_shp_decile) +
  tm_polygons('education_decile', palette='-GnBu', border.col='white', border.alpha = 0.5,
              title='Education Decile',
              labels = c("1 (Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10 (Least Deprived)")) +
  tm_layout (main.title = 'Deprivation Decile by LSOA', 
            legend.position = c('right','bottom'), frame=FALSE) +
  tm_credits("Data: Greater London Authority (2020), Ministry of Housing, Communities & Local Government (2020)", position = c("left","bottom"))

3 | Deprivation - Food Category

We will first focus on the impacts of Income Deprivation and Education Deprivation on food category choices.

According to Pechey et al. (2015), Darmon and Drewnowski (2008), fresh fruits and vegetables are more likely to be consumed by groups of higher socioeconomic status (SES). SES can be measured by occupation, income or education.

We hypothesize a positive correlation between less deprived LSOAs and fruits/vegetable purchases. We will be plotting the data using the ranking and decile of each LSOA.

grocery_dep <- grocery_LSOA %>%
                    filter(representativeness_norm > 0.1) %>% 
                    select ("area_id","f_beer", "f_dairy", "f_eggs", "f_fats_oils",
                            "f_fish","f_fruit_veg", "f_grains","f_meat_red",
                            "f_poultry","f_readymade", "f_sauces",
                            "f_soft_drinks","f_spirits","f_sweets",
                            "f_tea_coffee","f_water","f_wine")
names(grocery_dep) <- str_replace_all(names(grocery_dep), "f_", "")
indices_rank <- ID_LSOA %>%
                select ("LSOA code (2011)",
                        "IMD Rank (where 1 is most deprived)",
                        "Income Rank (where 1 is most deprived)",
                        "Education, Skills and Training Rank (where 1 is most deprived)") %>%
                rename(area_id = "LSOA code (2011)",
                       IMD = "IMD Rank (where 1 is most deprived)",
                       income = "Income Rank (where 1 is most deprived)",
                       education = "Education, Skills and Training Rank (where 1 is most deprived)")

grocery_rank <- right_join(indices_rank, grocery_dep)
indices_decile <- ID_LSOA %>%
                select ("LSOA code (2011)",
                        "IMD Decile (where 1 is most deprived 10% of LSOAs)",
                        "Income Decile (where 1 is most deprived 10% of LSOAs)",
                        "Education, Skills and Training Decile (where 1 is most deprived 10% of LSOAs)") %>%
                rename(area_id = "LSOA code (2011)",
                       IMD = "IMD Decile (where 1 is most deprived 10% of LSOAs)",
                       income = "Income Decile (where 1 is most deprived 10% of LSOAs)",
                       education = "Education, Skills and Training Decile (where 1 is most deprived 10% of LSOAs)")

grocery_decile <- right_join(indices_decile, grocery_dep)

grocery_decile$income <- factor(grocery_decile$income)
grocery_decile$education <- factor(grocery_decile$education)

datatable(grocery_decile, rownames = FALSE)
corr_dep <- cor(select(grocery_rank,-area_id))
corrplot(corr_dep, method='square', type='upper', title = 'Deprivation - Food Category',
          mar=c(0,0,3,0), addCoef.col = "black", number.cex=0.9, diag=FALSE, tl.srt=45)

Income Deprivation - Fruits & Vegetables

ggplot(grocery_rank, aes(x=income, y=fruit_veg)) +
geom_point(alpha=0.3)+
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Income Rank', y = 'Fruits and Vegetable', 
     title = 'Income Deprivation vs Fruits and Vegetable purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=30000,y=0.5,label='R = 0.42', color='blue') +
scale_x_continuous(breaks = c(5000, 10000, 15000, 20000, 25000, 30000), labels = c("(Most Deprived)", "10000", "15000", "20000", "25000", "(Least Deprived)"))

ggplot(grocery_decile, aes(x=income, y=fruit_veg)) +
geom_boxplot(alpha=0.3)+
theme_minimal() +
labs(x = 'Income Decile', y = 'Fruits and Vegetables', 
     title = 'Income Deprivation vs Fruits and Vegetable purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
scale_x_discrete(labels = c("1\n\n(Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10\n\n(Least Deprived)"))

Education Deprivation - Fruits & Vegetables

ggplot(grocery_rank, aes(x=education, y=fruit_veg)) +
geom_point(alpha=0.3)+
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Education Rank', y = 'Fruits and Vegetable', 
     title = 'Education Deprivation vs Fruits and Vegetable purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=30000,y=0.5,label='R = 0.56', color='blue') +
  scale_x_continuous(breaks = c(5000, 10000, 15000, 20000, 25000, 30000), labels = c("(Most Deprived)", "10000", "15000", "20000", "25000", "(Least Deprived)"))

ggplot(grocery_decile, aes(x=education, y=fruit_veg)) +
geom_boxplot(alpha=0.3)+
theme_minimal() +
labs(x = 'Education Decile', y = 'Fruits and Vegetables', 
     title = 'Education Deprivation vs Fruits and Vegetable purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
scale_x_discrete(labels = c("1\n\n(Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10\n\n(Least Deprived)"))

Observation
We observe that less income/education deprived LSOAs have a moderate positive correlation to fruits and vegetable purchases (0.42 / 0.56)

We also observe that less deprived LSOAs have a negative correlation to soft drinks and sweet purchases. It suggest that these items are less likely to be consumed by groups of higher SES.

Income - Soft Drinks

outlier <- grocery_rank %>% 
          filter(soft_drinks > 0.3) %>% 
          select(area_id,income,education,soft_drinks)
ggplot(grocery_rank, aes(x=income, y=soft_drinks)) +
geom_point(alpha=0.3)+
geom_point(data = outlier, aes(x=income, y=soft_drinks), color='red') +
geom_text_repel(data = outlier, label='Outlier') +
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Income Rank', y = 'Soft Drinks', 
     title = 'Income Deprivation vs Soft Drinks purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=30000,y=0.3,label='R = -0.3', color='blue') +
scale_x_continuous(breaks = c(5000, 10000, 15000, 20000, 25000, 30000), labels = c("(Most Deprived)", "10000", "15000", "20000", "25000", "(Least Deprived)"))

ggplot(grocery_decile, aes(x=income, y=soft_drinks)) +
geom_boxplot(alpha=0.3)+
theme_minimal() +
labs(x = 'Income Decile', y = 'Soft Drinks', 
     title = 'Income Deprivation vs Soft Drinks purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
scale_x_discrete(labels = c("1\n\n(Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10\n\n(Least Deprived)")) +
ylim(0,0.2)

For the boxplot, We have omitted the outlier shown in the scatter plot.

Education - Soft Drinks

ggplot(grocery_rank, aes(x=education, y=soft_drinks)) +
geom_point(alpha=0.3)+
geom_point(data = outlier, aes(x=education, y=soft_drinks), color='red') +
geom_text_repel(data = outlier, label='Outlier') +
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Education Rank', y = 'Soft Drinks', 
     title = 'Education Deprivation vs Soft Drinks purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=30000,y=0.3,label='R = -0.3', color='blue') +
scale_x_continuous(breaks = c(5000, 10000, 15000, 20000, 25000, 30000), labels = c("(Most Deprived)", "10000", "15000", "20000", "25000", "(Least Deprived)"))

ggplot(grocery_decile, aes(x=education, y=soft_drinks)) +
geom_boxplot(alpha=0.3)+
theme_minimal() +
labs(x = 'Education Decile', y = 'Soft Drinks', 
     title = 'Education Deprivation vs Soft Drinks purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
scale_x_discrete(labels = c("1\n\n(Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10\n\n(Least Deprived)")) +
ylim(0,0.2)

For the boxplot, We have omitted the outlier shown in the scatter plot.

Income - Sweets

ggplot(grocery_rank, aes(x=income, y=sweets)) +
geom_point(alpha=0.3)+
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Income Rank', y = 'Sweets', 
     title = 'Income Deprivation vs Sweets purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=30000,y=0.3,label='R = -0.22', color='blue') +
scale_x_continuous(breaks = c(5000, 10000, 15000, 20000, 25000, 30000), labels = c("(Most Deprived)", "10000", "15000", "20000", "25000", "(Least Deprived)"))

ggplot(grocery_decile, aes(x=income, y=sweets)) +
geom_boxplot(alpha=0.3)+
theme_minimal() +
labs(x = 'Income Decile', y = 'Sweets', 
     title = 'Income Deprivation vs Sweets purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
scale_x_discrete(labels = c("1\n\n(Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10\n\n(Least Deprived)"))

Education - Sweets

ggplot(grocery_rank, aes(x=education, y=sweets)) +
geom_point(alpha=0.3)+
geom_smooth(method='lm') +
theme_minimal() +
labs(x = 'Education Rank', y = 'Sweets', 
     title = 'Education Deprivation vs Sweets purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
  annotate(geom='text',x=30000,y=0.3,label='R = -0.41', color='blue') +
scale_x_continuous(breaks = c(5000, 10000, 15000, 20000, 25000, 30000), labels = c("(Most Deprived)", "10000", "15000", "20000", "25000", "(Least Deprived)"))

ggplot(grocery_decile, aes(x=education, y=sweets)) +
geom_boxplot(alpha=0.3)+
theme_minimal() +
labs(x = 'Education Decile', y = 'Sweets', 
     title = 'Education Deprivation vs Sweets purchases',
     caption = 'Data: Ministry of Housing, Communities & Local Government (2020),\nAiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L.. (2020)') +
scale_x_discrete(labels = c("1\n\n(Most Deprived)", "2", "3", "4", "5", "6", "7", "8", "9", "10\n\n(Least Deprived)"))

Observation
We observe that less income/education deprived LSOAs have a weak negative correlation to soft drink purchases (-0.3 / -0.38).
Also, less income deprived LSOA have a very weak negative correlation to sweet purchases (-0.22) while less education deprived LSOAs have a moderate negative correlation to sweet purchases (-0.41).


Conclusion

Summmary

All of our observations have verified findings from other papers albeit with varying levels of associations. In summary, we can infer the following from our analysis:

  • Age
    • (0.26) Weak postive correlation of children with soft drinks, and sweet purchases.
    • (-0.38) Weak negative correlation of adults with sweet purchases.
    • (0.38) Weak postive correlation of children with sugar in purchases.
    • (-0.42) Moderate negative correlation of adult with sugar in purchases.
    • (-0.32) Weak negative correlation of seniors with weight of purchases.
    • (0.30) Weak positive correlation of children with weight of purchases.

Children purchase more sugar and products like soft drinks and sweets.
Seniors purchase lower volume of foods while children consume larger volume of foods. this is likely due to physiological changes that occur with ageing which impair appetite amongst older people (Pilgrim et al. (2015)).

  • Ethnicity
    • (-0.46) Moderate negative correlation of Asian with red meat purchases.
    • (-0.38) Weak negative correlation of Asian with poultry purchases.
    • (-0.53) Moderate negative correlation of Asian with protein in purchases.
    • (-0.18) Very Weak negative correlation of Blacks with saturated fat in purchases.

Asians purchase less meat and less protein. This is likely due to predominantly plant-based traditional diets, usually comprising of rice, vegatables and soy (Hill (2019)).

  • Deprivation
    • (0.42) Moderate positive correlation of income with fruits/vegetable purchases.
    • (0.56) Moderate positive correlation of education with fruits/vegetable purchases.
    • (-0.30) Weak negative correlation of income with soft drinks purchases.
    • (-0.38) Weak negative correlation of education with soft drinks purchases.
    • (-0.22) Very weak negative correlation of income with sweets purchases.
    • (-0.41) Moderate negative correlation of income with sweets purchases.

Higher SES purchase more fruits and vegetables, and less soft drinks and sweets. Interestingly, education has a higher correlation when compared to income. We can interpret that healthier food purchases are associated more with educated consumers than purchasing power.

Limitations / Bias

We have also identified the following limitations and biases of the data.

  • The sample of people whose purchases are represented in this dataset is huge, but not random, because it represents a group of people who choose to buy at Tesco and sign up for a Clubcard membership on their own.

  • The grocery data does not provide us with sufficient information about the purchaser (level of individuals), we only have data of the LSOA (level of geography). Additionally, the number and type of people consuming the food purchased is also unknown. (e.g singles/families).

  • The ethnicity data was collected in 2011 which may not have been fully representative of Tesco grocery purchases recorded in 2015 as the geographic ethnic distribution may have changed over the years.


References

Aiello, Luca Maria, Daniele Quercia, Rossano Schifanella, and Lucia Del Prete. 2020. “Tesco Grocery 1.0, a Large-Scale Dataset of Grocery Purchases in London.” Scientific Data 7 (February). https://doi.org/10.1038/s41597-020-0397-7.
Azaïs-Braesco, Véronique, Diewertje Sluik, Matthieu Maillot, Frans Kok, and Luis A. Moreno. 2017. “A Review of Total & Added Sugar Intakes and Dietary Sources in Europe.” Nutrition Journal 16 (January). https://doi.org/10.1186/s12937-016-0225-2.
Darmon, Nicole, and Adam Drewnowski. 2008. “Does Social Class Predict Diet Quality?” The American Journal of Clinical Nutrition 87 (May): 1107–17. https://doi.org/10.1093/ajcn/87.5.1107.
Drewnowski, A., and J. M. Shultz. 2001. “Impact of Aging on Eating Behaviors, Food Choices, Nutrition, and Health Status.” The Journal of Nutrition, Health & Aging 5: 75–79. https://pubmed.ncbi.nlm.nih.gov/11426286/.
Goff, Louise M, Bruce A Griffin, Julie A Lovegrove, Tom A Sanders, Susan A Jebb, Les J Bluck, and Gary S Frost. 2013. “Ethnic Differences in Beta-Cell Function, Dietary Intake and Expression of the Metabolic Syndrome Among UK Adults of South Asian, Black African-Caribbean and White-European Origin at High Risk of Metabolic Syndrome.” Diabetes and Vascular Disease Research 10 (January): 315–23. https://doi.org/10.1177/1479164112467545.
Hill, Kayla. 2019. “Are Asian Diets Looking Fattier and Meatier Than Ever Before?” South China Morning Post. https://www.scmp.com/magazines/style/travel-food/article/3020071/do-asians-crave-fatty-sodium-filled-burgers-and-fried.
Pechey, Rachel, Pablo Monsivais, Yin-Lam Ng, and Theresa M. Marteau. 2015. “Why Don’t Poor Men Eat Fruit? Socioeconomic Differences in Motivations for Fruit Consumption.” Appetite 84 (January): 271–79. https://doi.org/10.1016/j.appet.2014.10.022.
Pilgrim, Anna L, Sian M Robinson, Avan Aihie Sayer, and Helen C Roberts. 2015. “An Overview of Appetite Decline in Older People.” Nursing Older People 27 (May): 29–35. https://doi.org/10.7748/nop.27.5.29.e697.
Stewart, Cristina, Carmen Piernas, Brian Cook, and Susan A. Jebb. 2021. “Trends in UK Meat Consumption: Analysis of Data from Years 1–11 (2008–09 to 2018–19) of the National Diet and Nutrition Survey Rolling Programme.” The Lancet Planetary Health 5 (October): e699–708. https://doi.org/10.1016/S2542-5196(21)00228-X.